NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores

Valpey, Benjamin; Li, Xinyi; Pai, Sreepathi; Gopalakrishnan, Ganesh (June 2025, NASA Formal Methods)
Titolo, Laura (Ed.)
Many recent computational accelerators provide non-standard (e.g., reduced precision) arithmetic operations to enhance performance for floating-point matrix multiplication. Unfortunately, the properties of these accelerators are not widely understood and lack sufficient descriptions of their behavior. This makes it difficult for tool builders beyond the original vendor to target or simulate the hardware correctly, or for algorithm designers to be confident in their code. To address these gaps, prior studies have probed the behavior of these units with manually crafted tests. Such tests are cumbersome to design, and adapting them as the accelerators evolve requires repeated manual effort. We present a formal model for the tensor cores of NVIDIA’s Volta, Turing, and Ampere GPUs. We identify specific properties—rounding mode, precision, and accumulation order—that drive these cores’ behavior. We formalize these properties and then use the formalization to automatically generate discriminating inputs that illustrate differences among machines. Our results confirm many of the findings of previous tensor core studies, but also identify subtle disagreements. In particular, NVIDIA’s machines do not, as previously reported, use round-to-zero for accumulation, and their 5-term accumulator requires 3 extra carry-out bits for full accuracy. Using our formal model, we analyze two existing algorithms that use half-precision tensor cores to accelerate single-precision multiplication with error correction. Our analysis reveals that the newer algorithm, designed to be more accurate than the first, is actually less accurate for certain inputs.
more » « less
Free, publicly-accessible full text available June 12, 2026
Modeling Utilization to Identify Shared-Memory Atomic Bottlenecks

https://doi.org/10.1145/3725798.3725801

Dong, Rongcui; Pai, Sreepathi (March 2025, ACM)

Performance analysis is critical for GPU programs with data-dependent behavior, but models like Roofline are not very useful for them and interpreting raw performance counters is tedious. In this work, we present an analytical model for shared memory atomics (fetch-and-op and compare-and-swap instructions on NVIDIA Volta and Ampere GPU) that allows users to immediately determine if shared memory atomic operations are a bottleneck for a program’s execution. Our model is based on modeling the architecture as a single-server queuing model whose inputs are performance counters. It captures load-dependent behavior such as pipelining, parallelism, and different access patterns. We embody this model in a tool that uses CUDA hardware counters as parameters to predict the utilization of the shared-memory atomic unit. To the best of our knowledge, no existing profiling tool or model provides this capability for shared-memory atomic operations. We used the model to compare two histogram kernels that use shared-memory atomics. Although nearly identical, their performance can be different by up to 30%. Our tool correctly identifies a bottleneck shift from shared-memory atomic unit as the cause of this discrepancy
more » « less
Free, publicly-accessible full text available March 1, 2026
An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores

https://doi.org/10.1007/978-3-031-93706-4_21

Valpey, Benjamin; Li, Xinyi; Pai, Sreepathi; Gopalakrishnan, Ganesh (January 2025, Springer Nature Switzerland)

Full Text Available
Asynchronous Automata Processing on GPUs

https://doi.org/10.1145/3579453

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait (March 2023, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.
more » « less
Why GPUs are Slow at Executing NFAs and How to Make them Faster

https://doi.org/10.1145/3373376.3378471

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait (March 2020, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems)
null (Ed.)
Full Text Available
Statistical caching for near memory management

https://doi.org/10.1145/3357526.3357557

Chen, Dong; Liu, Fangzhou; Jiao, Mingyang; Ding, Chen; Pai, Sreepathi (September 2019, Proceedings of the International Symposium on Memory Systems)

Modern GPUs often use near memory or high-bandwidth memory, which may be managed as cache when the application data is too large to fit in the near memory. Unlike CPU caches, the near memory cache has a much larger size. A recent approach is statistical caching, which shows near optimal results when managing large memory for file caching. The prior work is ideal and not practical. This paper outlines two extensions. It first formulates a new caching algorithm called least expected use (LEU) replacement and shows, through examples, that the statistical solution automatically integrates two otherwise disparate policies. Then the paper describes a system design to implement LEU. To position the new design for discussion, the paper draws parallels with two familiar ideas, branch prediction and spectral analysis, and considers a set of opportunities and challenges of achieving statistical caching in near memory.
more » « less
Full Text Available
Architectural Support for Efficient Large-Scale Automata Processing

https://doi.org/10.1109/MICRO.2018.00078

Liu, Hongyuan; Ibrahim, Mohamed; Kayiran, Onur; Pai, Sreepathi; Jog, Adwait (October 2018, 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))

Full Text Available
Locality analysis through static parallel sampling

https://doi.org/10.1145/3192366.3192402

Chen, Dong; Liu, Fangzhou; Ding, Chen; Pai, Sreepathi (January 2018, PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation)

Full Text Available
Bounded exhaustive test-input generation on GPUs

https://doi.org/10.1145/3133918

Celik, Ahmet; Pai, Sreepathi; Khurshid, Sarfraz; Gligoric, Milos (October 2017, Proceedings of the ACM on Programming Languages)

Full Text Available
Controlled Kernel Launch for Dynamic Parallelism in GPUs

https://doi.org/10.1109/HPCA.2017.14

Tang, Xulong; Pattnaik, Ashutosh; Jiang, Huaipan; Kayiran, Onur; Jog, Adwait; Pai, Sreepathi; Ibrahim, Mohamed; Kandemir, Mahmut T.; Das, Chita R. (February 2017, IEEE International Symposium on High Performance Computer Architecture (HPCA))

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
more » « less
Full Text Available

Search for: All records